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Abstract 

We  propose  a  theory  that  gives  formal  seman¬ 
tics  to  word-level  alignments  defined  over  par¬ 
allel  corpora.  We  use  our  theory  to  introduce  a 
linear  algorithm  that  can  be  used  to  derive  from 
word-aligned,  parallel  corpora  the  minimal  set 
of  syntactically  motivated  transformation  rules 
that  explain  human  translation  data. 

1  Introduction 

In  a  very  interesting  study  of  syntax  in  statistical  machine 
translation.  Fox  (2002)  looks  at  how  well  proposed  trans¬ 
lation  models  fit  actual  translation  data.  One  such  model 
embodies  a  restricted,  linguistically-motivated  notion  of 
word  re-ordering.  Given  an  English  parse  tree,  children 
at  any  node  may  be  reordered  prior  to  translation.  Nodes 
are  processed  independently.  Previous  to  Fox  (2002),  it 
had  been  observed  that  this  model  would  prohibit  certain 
re-orderings  in  certain  language  pairs  (such  as  subject- 
VP(verb-object)  into  verb-subject-object),  but  Fox  car¬ 
ried  out  the  first  careful  empirical  study,  showing  that 
many  other  common  translation  patterns  fall  outside  the 
scope  of  the  child-reordering  model.  This  is  true  even 
for  languages  as  similar  as  English  and  French.  For 
example,  English  adverbs  tend  to  move  outside  the  lo¬ 
cal  parent/children  in  environment.  The  English  word 
“not”  translates  to  the  discontiguous  pair  “ne  ...  pas.” 
English  parsing  errors  also  cause  trouble,  as  a  normally 
well-behaved  re-ordering  environment  can  be  disrupted 
by  wrong  phrase  attachment.  For  other  language  pairs, 
the  divergence  is  expected  to  be  greater. 

In  the  face  of  these  problems,  we  may  choose  among 
several  alternatives.  The  first  is  to  abandon  syntax  in 
statistical  machine  translation,  on  the  grounds  that  syn¬ 
tactic  models  are  a  poor  fit  for  the  data.  On  this  view, 
adding  syntax  yields  no  improvement  over  robust  phrase- 
substitution  models,  and  the  only  question  is  how  much 


does  syntax  hurt  performance.  Along  this  line,  (Koehn 
et  al.,  2003)  present  convincing  evidence  that  restricting 
phrasal  translation  to  syntactic  constituents  yields  poor 
translation  performance  -  the  ability  to  translate  non¬ 
constituent  phrases  (such  as  “there  are”,  “note  that”,  and 
“according  to”)  turns  out  to  be  critical  and  pervasive. 

Another  direction  is  to  abandon  conventional  English 
syntax  and  move  to  more  robust  grammars  that  adapt  to 
the  parallel  training  corpus.  One  approach  here  is  that  of 
Wu  (1997),  in  which  word-movement  is  modeled  by  rota¬ 
tions  at  unlabeled,  binary-branching  nodes.  At  each  sen¬ 
tence  pair,  the  parse  adapts  to  explain  the  translation  pat¬ 
tern.  If  the  same  unambiguous  English  sentence  were  to 
appear  twice  in  the  corpus,  with  different  Chinese  trans¬ 
lations,  then  it  could  have  different  learned  parses. 

A  third  direction  is  to  maintain  English  syntax  and 
investigate  alternate  transformation  models.  After  all, 
many  conventional  translation  systems  are  indeed  based 
on  syntactic  transformations  far  more  expressive  than 
what  has  been  proposed  in  syntax-based  statistical  MT. 
We  take  this  approach  in  our  paper.  Of  course,  the  broad 
statistical  MT  program  is  aimed  at  a  wider  goal  than 
the  conventional  rule-based  program  -  it  seeks  to  under¬ 
stand  and  explain  human  translation  data,  and  automati¬ 
cally  learn  from  it.  For  this  reason,  we  think  it  is  impor¬ 
tant  to  learn  from  the  model/data  explainability  studies  of 
Fox  (2002)  and  to  extend  her  results.  In  addition  to  being 
motivated  by  rule-based  systems,  we  also  see  advantages 
to  English  syntax  within  the  statistical  framework,  such 
as  marrying  syntax-based  translation  models  with  syntax- 
based  language  models  (Charniak  et  al.,  2003)  and  other 
potential  benefits  described  by  Eisner  (2003). 

Our  basic  idea  is  to  create  transformation  rules  that 
condition  on  larger  fragments  of  tree  structure.  It  is 
certainly  possible  to  build  such  rules  by  hand,  and  we 
have  done  this  to  formally  explain  a  number  of  human- 
translation  examples.  But  our  main  interest  is  in  collect¬ 
ing  a  large  set  of  such  rules  automatically  through  corpus 
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analysis.  The  search  for  these  rules  is  driven  exactly  by 
the  problems  raised  by  Fox  (2002)  -  cases  of  crossing 
and  divergence  motivate  the  algorithms  to  come  up  with 
better  explanations  of  the  data  and  better  rules.  Section 
2  of  this  paper  describes  algorithms  for  the  acquisition 
of  complex  rules  for  a  transformation  model.  Section  3 
gives  empirical  results  on  the  explanatory  power  of  the 
acquired  rules  versus  previous  models.  Section  4  presents 
examples  of  learned  rules  and  shows  the  various  types  of 
transformations  (lexical  and  nonlexical,  contiguous  and 
noncontiguous,  simple  and  complex)  that  the  algorithms 
are  forced  (by  the  data)  to  invent.  Section  5  concludes. 
Due  to  space  constraints,  all  proofs  are  omitted. 

2  Rule  Acquisition 

Suppose  that  we  have  a  French  sentence,  its  translation 
into  English,  and  a  parse  tree  over  the  English  translation, 
as  shown  in  Eigure  1 .  Generally  one  defines  an  alignment 
as  a  relation  between  the  words  in  the  Erench  sentence 
and  the  words  in  the  English  sentence.  Given  such  an 
alignment  however,  what  kinds  of  rules  are  we  entitled 
to  learn  from  this  instance?  How  do  we  know  when  it  is 
valid  to  extract  a  particular  rule,  especially  in  the  pres¬ 
ence  of  numerous  crossings  in  the  alignment?  In  this  sec¬ 
tion,  we  give  principled  answers  to  these  questions,  by 
constructing  a  theory  that  gives  formal  semantics  to  word 
alignments. 
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Eigure  1:  A  Erench  sentence  aligned  with  an  English 
parse  tree. 
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2.1  A  Theory  of  Word  Alignments 

We  are  going  to  define  a  generative  process  through 
which  a  string  from  a  source  alphabet  is  mapped  to  a 
rooted  tree  whose  nodes  are  labeled  from  a  target  alpha¬ 
bet.  Henceforth  we  will  refer  to  symbols  from  our  source 
alphabet  as  source  symbols  and  symbols  from  our  target 
alphabet  as  target  symbols.  We  define  a  symbol  tree  over 
an  alphabet  A  as  a  rooted,  directed  tree,  the  nodes  of 
which  are  each  labeled  with  a  symbol  of  A. 

We  want  to  capture  the  process  by  which  a  symbol  tree 
over  the  target  language  is  derived  from  a  string  of  source 
symbols.  Let  us  refer  to  the  symbol  tree  that  we  want  to 
derive  as  the  target  tree.  Any  subtree  of  this  tree  will  be 
called  a  target  subtree.  Eurthermore,  we  define  a  deriva¬ 
tion  string  as  an  ordered  sequence  of  elements,  each  of 
which  is  either  a  source  symbol  or  a  target  subtree. 

Now  we  are  ready  to  define  the  derivation  process. 
Given  a  derivation  string  S,  a  derivation  step  replaces 
a  substring  S'  of  S  with  a  target  subtree  T  that  has  the 
following  properties: 

1 .  Any  target  subtree  in  S'  is  a  subtree  of  T. 


Eigure  2:  Three  alternative  derivations  from  a  source  sen¬ 
tence  to  a  target  tree. 

Moreover,  a  derivation  from  a  string  S  of  source  sym¬ 
bols  to  the  target  tree  T  is  a  sequence  of  derivation  steps 
that  produces  T  from  S. 

Moving  away  from  the  abstract  for  a  moment,  let  us 
revisit  the  example  from  Eigure  1 .  Eigure  2  shows  three 
derivations  of  the  target  tree  from  the  source  string  “il 
ne  va  pas”,  which  are  all  consistent  with  our  defini¬ 
tions.  However,  it  is  apparent  that  one  of  these  deriva¬ 
tions  seems  much  more  “wrong”  than  the  other.  Specif¬ 
ically,  in  the  second  derivation,  “pas”  is  replaced  by  the 
English  word  “he,”  which  makes  no  sense.  Given  the  vast 
space  of  possible  derivations  (according  to  the  definition 
above),  how  do  we  distinguish  between  good  ones  and 
bad  ones?  Here  is  where  the  notion  of  an  alignment  be¬ 
comes  useful. 

Let  S'  be  a  string  of  source  symbols  and  let  T  be  a  target 
tree.  Eirst  observe  the  following  facts  about  derivations 
from  S  to  T  (these  follow  directly  from  the  definitions): 


2.  Any  target  subtree  in  S  but  not  in  S'  does  not  share 
nodes  with  T. 


1 .  Each  element  of  S  is  replaced  at  exactly  one  step  of 
the  derivation. 
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2.  Each  node  of  T  is  created  at  exactly  one  step  of  the 
derivation. 

Thus  for  each  element  s  of  S,  we  can  define 
replaced{s,  D)  to  be  the  step  of  the  derivation  D  during 
which  s  is  replaced.  For  instance,  in  the  leftmost  deriva¬ 
tion  of  Figure  2,  “va”  is  replaced  by  the  second  step  of  the 
derivation,  thus  replaced{va,  D)  =  2.  Similarly,  for  each 
node  t  of  T,  we  can  define  created{t,  D)  to  be  the  step 
of  derivation  D  during  which  t  is  created.  For  instance, 
in  the  same  derivation,  the  nodes  labeled  by  “AUX”  and 
“VP”  are  created  during  the  third  step  of  the  derivation, 
thus  created{AUX,  D)  =  3  and  created{VP,  D)  =  3. 

Given  a  string  S  of  source  symbols  and  a  target  tree 
T,  an  alignment  A  with  respect  to  S  and  T  is  a  relation 
between  the  leaves  of  T  and  the  elements  of  S.  Choose 
some  derivation  D  from  S  to  T.  The  alignment  A  in¬ 
duced  by  D  is  created  as  follows:  an  element  s  of  S'  is 
aligned  with  a  leaf  node  t  of  T  iff  replaced{s,  D)  = 
created{t,  D).  In  other  words,  a  source  word  is  aligned 
with  a  target  word  if  the  target  word  is  created  during  the 
same  step  in  which  the  source  word  is  replaced.  Figure  3 
shows  the  alignments  induced  by  the  derivations  of  Fig¬ 
ure  2. 

Now,  say  that  we  have  a  source  string,  a  target  tree, 
and  an  alignment  A.  A  key  observation  is  that  the  set 
of  “good”  derivations  according  to  A  is  precisely  the  set 
of  derivations  that  induce  alignments  A!  such  that  A  is 
a  subalignment  of  A' .  By  subalignment,  we  mean  that 
A  'A  A!  (recall  that  alignments  are  simple  mathematical 
relations).  In  other  words,  A  is  a  subalignment  of  Ad  if  A 
aligns  two  elements  only  if  Ad  also  aligns  them. 

We  can  see  this  intuitively  by  examining  Figures  2  and 
3.  Notice  that  the  two  derivations  that  seem  “right”  (the 
first  and  the  third)  are  superalignments  of  the  alignment 
given  in  Figure  1,  while  the  derivation  that  is  clearly 
wrong  is  not.  Hence  we  now  have  a  formal  definition 
of  the  derivations  that  we  are  interested  in.  We  say  that 
a  derivation  is  admitted  by  an  alignment  A  if  it  induces  a 
superalignment  of  A.  The  set  of  derivations  from  source 
string  S  to  target  tree  T  that  are  admitted  by  alignment  A 
can  be  denoted  Sa{S,  T).  Given  this,  we  are  ready  to  ob¬ 
tain  a  formal  characterization  of  the  set  of  rules  that  can 
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Figure  4:  Two  derivation  steps  and  the  rules  that  are  in¬ 
duced  from  them. 


be  inferred  from  the  source  string,  target  tree,  and  align¬ 
ment. 

2.2  From  Derivations  to  Rules 

In  essence,  a  derivation  step  can  be  viewed  as  the  applica¬ 
tion  of  a  rule.  Thus,  compiling  the  set  of  derivation  steps 
used  in  any  derivation  of  5^(5',  T)  gives  us,  in  a  mean¬ 
ingful  sense,  all  relevant  rules  that  can  be  extracted  from 
the  triple  {S,  T,  A).  In  this  section,  we  show  in  concrete 
terms  how  to  convert  a  derivation  step  into  a  usable  rule. 

Consider  the  second-last  derivation  step  of  the  first 
derivation  in  Figure  2.  In  it,  we  begin  with  a  source  sym¬ 
bol  “ne”,  followed  by  a  target  subtree  rooted  at  V B,  fol¬ 
lowed  by  another  source  symbol  “pas.”  These  three  ele¬ 
ments  of  the  derivation  string  are  replaced  with  a  target 
subtree  rooted  at  VP  that  discards  the  source  symbols 
and  contains  the  target  subtree  rooted  atVB.  In  general, 
this  replacement  process  can  be  captured  by  the  rule  de¬ 
picted  in  Figure  4.  The  input  to  the  rule  are  the  roots 
of  the  elements  of  the  derivation  string  that  are  replaced 
(where  we  define  the  root  of  a  symbol  to  be  simply  the 
symbol  itself),  whereas  the  output  of  the  rule  is  a  symbol 
tree,  except  that  some  of  the  leaves  are  labeled  with  vari¬ 
ables  instead  of  symbols  from  the  target  alphabet.  These 
variables  correspond  to  elements  of  the  input  to  the  rule. 
For  instance,  the  leaf  labeled  x2  means  that  when  this  rule 
is  applied,  x2  is  replaced  by  the  target  subtree  rooted  at 
VB  (since  VB  is  the  second  element  of  the  input).  Ob¬ 
serve  that  the  second  rule  induced  in  Figure  4  is  simply 
a  CFG  rule  expressed  in  the  opposite  direction,  thus  this 
rule  format  can  (and  should)  be  viewed  as  a  strict  gener¬ 
alization  of  CFG  rules. 
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Figure  5:  An  alignment  graph.  The  nodes  are  annotated 
with  their  spans.  Nodes  in  the  frontier  set  are  boldfaced 
and  italicized. 


Figure  6:  Two  frontier  graph  fragments  and  the  rules  in¬ 
duced  from  them.  Observe  that  the  spans  of  the  sink 
nodes  form  a  partition  of  the  span  of  the  root. 


Every  derivation  step  can  be  mapped  to  a  rule  in  this 
way.  Hence  given  a  source  string  S,  a  target  tree  T,  and 
an  alignment  A,  we  can  define  the  set  pa{S,T)  as  the  set 
of  rules  in  any  derivation  D  €  Sa{S,  T).  We  can  regard 
this  as  the  set  of  rules  that  we  are  entitled  to  infer  from 
the  triple  {S,T,  A). 

2.3  Inferring  Complex  Rules 

Now  we  have  a  precise  problem  statement:  learn  the  set 
Pa{S,  T).  It  is  not  immediately  clear  how  such  a  set  can 
be  learned  from  the  triple  (S',  T,  A).  Fortunately,  we  can 
infer  these  rules  directly  from  a  structure  called  an  align¬ 
ment  graph.  In  fact,  we  have  already  seen  numerous  ex¬ 
amples  of  alignment  graphs.  Graphically,  we  have  been 
depicting  the  triple  (S,  T,  A)  as  a  rooted,  directed,  acyclic 
graph  (where  direction  is  top-down  in  the  diagrams).  We 
refer  to  such  a  graph  as  an  alignment  graph.  Formally, 
the  alignment  graph  corresponding  to  S,  T,  and  A  is  just 
T,  augmented  with  a  node  for  each  element  of  S,  and 
edges  from  leaf  node  t  €  T  to  element  s  S  S  iff  A  aligns 
s  with  t.  Although  there  is  a  difference  between  a  node 
of  the  alignment  graph  and  its  label,  we  will  not  make  a 
distinction,  to  ease  the  notational  burden. 

To  make  the  presentation  easier  to  follow,  we  assume 
throughout  this  section  that  the  alignment  graph  is  con¬ 
nected,  i.e.  there  are  no  unaligned  elements.  All  of  the 
results  that  follow  have  generalizations  to  deal  with  un¬ 
aligned  elements,  but  unaligned  elements  incur  certain 
procedural  complications  that  would  cloud  the  exposi¬ 
tion. 

It  turns  out  that  it  is  possible  to  systematically  con¬ 
vert  certain  fragments  of  the  alignment  graph  into  rules 
of  pa{S,  T).  We  define  a  fragment  of  a  directed,  acyclic 
graph  G  to  be  a  nontrivial  (i.e.  not  just  a  single  node)  sub¬ 
graph  G'  of  G  such  that  if  a  node  n  is  in  G'  then  either  n 
is  a  sink  node  of  G'  (i.e.  it  has  no  children)  or  all  of  its 
children  are  in  G'  (and  it  is  connected  to  all  of  them).  In 


Figure  6,  we  show  two  examples  of  graph  fragments  of 
the  alignment  graph  of  Figure  5. 

The  span  of  a  node  n  of  the  alignment  graph  is  the 
subset  of  nodes  from  S  that  are  reachable  from  n.  Note 
that  this  definition  is  similar  to,  but  not  quite  the  same 
as,  the  definition  of  a  span  given  by  Fox  (2002).  We 
say  that  a  span  is  contiguous  if  it  contains  all  elements 
of  a  contiguous  substring  of  S.  The  closure  of  span{n) 
is  the  shortest  contiguous  span  which  is  a  superset  of 
spanin).  For  instance,  the  closure  of  {52,53,55,57} 
would  be  {52,  53,  54,  55,  56,  57}  The  alignment  graph  in 
Figure  5  is  annotated  with  the  span  of  each  node. 

Take  a  look  at  the  graph  fragments  in  Figure  6.  These 
fragments  are  special:  they  are  examples  of  frontier 
graph  fragments.  We  first  define  the  frontier  set  of  an 
alignment  graph  to  be  the  set  of  nodes  n  that  satisfy  the 
following  property:  for  every  node  n'  of  the  alignment 
graph  that  is  connected  to  n  but  is  neither  an  ancestor  nor 
a  descendant  of  n,  span{n')  n  closure{span{n))  =  0. 

We  then  define  a  frontier  graph  fragment  of  an  align¬ 
ment  graph  to  be  a  graph  fragment  such  that  the  root  and 
all  sinks  are  in  the  frontier  set.  Frontier  graph  fragments 
have  the  property  that  the  spans  of  the  sinks  of  the  frag¬ 
ment  are  each  contiguous  and  form  a  partition  of  the  span 
of  the  root,  which  is  also  contiguous.  This  allows  the  fol¬ 
lowing  transformation  process: 

1 .  Place  the  sinks  in  the  order  defined  by  the  partition 
(i.e.  the  sink  whose  span  is  the  first  part  of  the  span 
of  the  root  goes  first,  the  sink  whose  span  is  the  sec¬ 
ond  part  of  the  span  of  the  root  goes  second,  etc.). 
This  forms  the  input  of  the  rule. 

2.  Replace  sink  nodes  of  the  fragment  with  a  variable 
corresponding  to  their  position  in  the  input,  then 
take  the  tree  part  of  the  fragment  (i.e.  project  the 
fragment  on  T).  This  forms  the  output  of  the  rule. 
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Figure  6  shows  the  rules  derived  from  the  given  graph 
fragments.  We  have  the  following  result. 

Theorem  1  Rules  constructed  according  to  the  above 
procedure  are  in  pa(S,  T). 

Rule  extraction:  Algorithm  1.  Thus  we  now  have  a 
simple  method  for  extracting  rules  of  pa{S,  T)  from  the 
alignment  graph:  search  the  space  of  graph  fragments  for 
frontier  graph  fragments. 

Unfortunately,  the  search  space  of  all  fragments  of  a 
graph  is  exponential  in  the  size  of  the  graph,  thus  this 
procedure  can  also  take  a  long  time  to  execute.  To  ar¬ 
rive  at  a  much  faster  procedure,  we  take  advantage  of  the 
following  provable  facts: 

1.  The  frontier  set  of  an  alignment  graph  can  be  identi- 
hed  in  time  linear  in  the  size  of  the  graph. 

2.  For  each  node  n  of  the  frontier  set,  there  is  a  unique 
minimal  frontier  graph  fragment  rooted  at  n  (ob¬ 
serve  that  for  any  node  n'  not  in  the  frontier  set, 
there  is  no  frontier  graph  fragment  rooted  at  n',  by 
dehnition). 

By  minimal,  we  mean  that  the  frontier  graph  fragment 
is  a  subgraph  of  every  other  frontier  graph  fragment  with 
the  same  root.  Clearly,  for  an  alignment  graph  with  k 
nodes,  there  are  at  most  k  minimal  frontier  graph  frag¬ 
ments.  In  Figure  7,  we  show  the  seven  minimal  frontier 
graph  fragments  of  the  alignment  graph  of  Figure  5.  Fur¬ 
thermore,  all  other  frontier  graph  fragments  can  be  cre¬ 
ated  by  composing  2  or  more  minimal  graph  fragments, 
as  shown  in  Figure  8.  Thus,  the  entire  set  of  frontier  graph 
fragments  (and  all  rules  derivable  from  these  fragments) 
can  be  computed  systematically  as  follows:  compute  the 
set  of  minimal  frontier  graph  fragments,  compute  the  set 
of  graph  fragments  resulting  from  composing  2  minimal 
frontier  graph  fragments,  compute  the  set  of  graph  frag¬ 
ments  resulting  from  composing  3  minimal  graph  frag¬ 
ments,  etc.  In  this  way,  the  rules  derived  from  the  min¬ 
imal  frontier  graph  fragments  can  be  regarded  as  a  ba¬ 
sis  for  all  other  rules  derivable  from  frontier  graph  frag¬ 
ments.  Furthermore,  we  conjecture  that  the  set  of  rules 
derivable  from  frontier  graph  fragments  is  in  fact  equiva¬ 
lent  to  pa{S,  T). 

Thus  we  have  boiled  down  the  problem  of  extracting 
complex  rules  to  the  following  simple  problem:  hnd  the 
set  of  minimal  frontier  graph  fragments  of  a  given  align¬ 
ment  graph. 

The  algorithm  is  a  two-step  process,  as  shown  below. 
Rule  extraction:  Algorithm  2 

1.  Compute  the  frontier  set  of  the  alignment  graph. 

2.  For  each  node  of  the  frontier  set,  compute  the  mini¬ 
mal  frontier  graph  fragment  rooted  at  that  node. 
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Figure  7 :  The  seven  minimal  frontier  graph  fragments  of 
the  alignment  graph  in  Figure  5 
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Figure  8:  Example  compositions  of  minimal  frontier 
graph  fragments  into  larger  frontier  graph  fragments. 


Step  1  can  be  computed  in  a  single  traversal  of  the 
alignment  graph.  This  traversal  annotates  each  node  with 
its  span  and  its  complement  span.  The  complement  span 
is  computed  as  the  union  of  the  complement  span  of  its 
parent  and  the  span  of  all  its  siblings  (siblings  are  nodes 
that  share  the  same  parent).  A  node  n  is  in  the  frontier 
set  iff  complement ^pan{n)  n  closure{span{n))  =  0. 
Notice  that  the  complement  span  merely  summarizes  the 
spans  of  all  nodes  that  are  neither  ancestors  nor  descen- 
dents  of  n.  Since  this  step  requires  only  a  single  graph 
traversal,  it  runs  in  linear  time. 

Step  2  can  also  be  computed  straightforwardly.  For 
each  node  n  of  the  frontier  set,  do  the  following:  expand 
n,  then  as  long  as  there  is  some  sink  node  n'  of  the  result¬ 
ing  graph  fragment  that  is  not  in  the  frontier  set,  expand 
n' .  Note  that  after  computing  the  minimal  graph  frag¬ 
ment  rooted  at  each  node  of  the  frontier  set,  every  node 
of  the  alignment  graph  has  been  expanded  at  most  once. 
Thus  this  step  also  runs  in  linear  time. 

For  clarity  of  exposition  and  lack  of  space,  a  couple  of 
issues  have  been  glossed  over.  Briefly: 

•  As  previously  stated,  we  have  ignored  here  the  is¬ 
sue  of  unaligned  elements,  but  the  procedures  can 
be  easily  generalized  to  accommodate  these.  The 


results  of  the  next  two  sections  are  all  based  on  im¬ 
plementations  that  handle  unaligned  elements. 

•  This  theory  can  be  generalized  quite  cleanly  to  in¬ 
clude  derivations  for  which  substrings  are  replaced 
by  sets  of  trees,  rather  than  one  single  tree.  This 
corresponds  to  allowing  rules  that  do  not  require  the 
output  to  be  a  single,  rooted  tree.  Such  a  general¬ 
ization  gives  some  nice  power  to  effectively  explain 
certain  linguistic  phenomena.  For  instance,  it  allows 
us  to  immediately  translate  “va”  as  “does  go”  in¬ 
stead  of  delaying  the  creation  of  the  auxiliary  word 
“does”  until  later  in  the  derivation. 

3  Experiments 

3.1  Language  Choice 

We  evaluated  the  coverage  of  our  model  of  transforma¬ 
tion  rules  with  two  language  pairs:  English-French  and 
English-Chinese.  These  two  pairs  clearly  contrast  by 
the  underlying  difficulty  to  understand  and  model  syntac¬ 
tic  transformations  among  pairs:  while  there  is  arguably 
a  fair  level  of  cohesion  between  English  and  Erench, 
English  and  Chinese  are  syntactically  more  distant  lan¬ 
guages.  We  also  chose  Erench  to  compare  our  study  with 
that  of  Eox  (2002).  The  additional  language  pair  provides 
a  good  means  of  evaluating  how  our  transformation  rule 
extraction  method  scales  to  more  problematic  language 
pairs  for  which  child-reordering  models  are  shown  not  to 
explain  the  data  well. 

3.2  Data 

We  performed  experiments  with  two  corpora,  the  EBIS 
English-Chinese  Parallel  Text  and  the  Hansard  Erench- 
English  corpus. We  parsed  the  English  sentences  with 
a  state-of-the-art  statistical  parser  (Collins,  1999).  Eor 
the  EBIS  corpus  (representing  eight  million  English 
words),  we  automatically  generated  word-alignments  us¬ 
ing  GIZAh-h-  (Och  and  Ney,  2003),  which  we  trained  on 
a  much  larger  data  set  (150  million  words).  Cases  other 
than  one-to-one  sentence  mappings  were  eliminated.  Eor 
the  Hansard  corpus,  we  took  the  human  annotation  of 
word  alignment  described  in  (Och  and  Ney,  2000).  The 
corpus  contains  two  kinds  of  alignments:  S  (sure)  for 
unambiguous  cases  and  P  (possible)  for  unclear  cases, 
e.g.  idiomatic  expressions  and  missing  function  words 
(S  C  P).  In  order  to  be  able  to  make  legitimate  com¬ 
parisons  between  the  two  language  pairs,  we  also  used 
GIZA-h-  to  obtain  machine-generated  word  alignments 
for  Hansard:  we  trained  it  with  the  500  sentences  and 
additional  data  representing  13.7  million  English  words 
(taken  from  the  Hansard  and  European  parliament  cor¬ 
pora). 


3.3  Results 

Erom  a  theoretical  point  of  view,  we  have  shown  that  our 
model  can  fully  explain  the  transformation  of  any  parse 
tree  of  the  source  language  into  a  string  of  the  target  lan¬ 
guage.  The  purpose  of  this  section  is  twofold:  to  pro¬ 
vide  quantitative  results  confirming  the  full  coverage  of 
our  model  and  to  analyze  some  properties  of  the  trans¬ 
formation  rules  that  support  these  derivations  (linguistic 
analyses  of  these  rules  are  presented  in  the  next  section). 

Eigure  9  summarizes  the  coverage  of  our  model  with 
respect  to  the  Hansard  and  EBIS  corpora.  Eor  the  for¬ 
mer,  we  present  results  for  the  three  alignments:  S  align¬ 
ments,  P  alignments,  and  the  alignments  computed  by 
GIZA-h-.  Each  plotted  value  represents  a  percentage  of 
parse  trees  in  a  corpus  that  can  be  transformed  into  a  tar¬ 
get  sentence  using  transformation  rules.  The  x-axis  rep¬ 
resents  different  restrictions  on  the  size  of  these  rules:  if 
we  use  a  model  that  restrict  rules  to  a  single  expansion 
of  a  non-terminal  into  a  sequence  of  symbols,  we  are  in 
the  scope  of  the  child-reordering  model  of  (Yamada  and 
Knight,  2001;  Eox,  2002).  We  see  that  its  explanatory 
power  is  quite  poor,  with  only  19.4%,  14.3%,  16.5%,  and 
12.1%  (for  the  respective  corpora).  Allowing  more  ex¬ 
pansions  logically  expands  the  coverage  of  the  model, 
until  the  point  where  it  is  total:  transformation  rules  no 
larger  than  17,  18,  23,  and  43  (in  number  of  rule  expan¬ 
sions)  respectively  provide  enough  coverage  to  explain 
the  data  at  100%  for  each  of  the  four  cases. 

It  appears  from  the  plot  that  the  quality  of  alignments 
plays  an  important  role.  If  we  compare  the  three  kinds  of 
alignments  available  for  the  Hansard  corpus,  we  see  that 
much  more  complex  transformation  rules  are  extracted 
from  noisy  GIZA-h-  alignments.  It  also  appears  that  the 
language  difference  produces  quite  contrasting  results. 
Rules  acquired  for  the  English-Chinese  pair  have,  on  av¬ 
erage,  many  more  nodes.  Note  that  the  language  differ¬ 
ence  in  terms  of  syntax  might  be  wider  than  what  the  plot 
seems  to  indicate,  since  word  alignments  computed  for 
the  Hansard  corpus  are  likely  to  be  more  errorful  than  the 
ones  for  EBIS  because  the  training  data  used  to  induce  the 
latter  is  more  than  ten  times  larger  than  for  the  former. 

In  Eigure  10,  we  show  the  explanatory  power  of  our 
model  at  the  node  level.  At  each  node  of  the  frontier 
set,  we  determine  whether  it  is  possible  to  extract  a  rule 
that  doesn’t  exceed  a  given  limit  k  on  its  size.  The  plot¬ 
ted  values  represent  the  percentage  of  frontier  set  inter¬ 
nal  nodes  that  satisfy  this  condition.  These  results  appear 
more  promising  for  the  child-reordering  model,  with  cov¬ 
erage  ranging  from  72.3%  to  85.1%  of  the  nodes,  but  we 
should  keep  in  mind  that  many  of  these  nodes  are  low  in 
the  tree  (e.g.  base  NPs);  extraction  of  1 -level  transfor¬ 
mation  rules  generally  present  no  difficulties  when  child 
nodes  are  pre-terminals,  since  any  crossings  can  be  re¬ 
solved  by  lexicalizing  the  elements  involved  in  it.  How- 
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Figure  9;  Percentage  of  parse  trees  covered  by  the  model 
given  different  constraints  on  the  maximum  size  of  the 
transformation  rules. 


Figure  1 1 :  Adverb-verb  reordering. 


4  Discussions 


Figure  10:  Same  as  Figure  9,  except  that  here  coverage  is 
evaluated  at  the  node  level. 


ever,  higher  level  syntactic  constituents  are  more  prob¬ 
lematic  for  child-reordering  models,  and  the  main  rea¬ 
sons  they  fail  to  provide  explanation  of  the  parses  at  the 
sentence  level. 

Table  1  shows  that  the  extraction  of  rules  can  be  per¬ 
formed  quite  efficiently.  Our  hrst  algorithm,  which  has  an 
exponential  running  time,  cannot  scale  to  process  large 
corpora  and  extract  a  sufficient  number  of  rules  that  a 
syntax-based  statistical  MT  system  would  require.  The 
second  algorithm,  which  runs  in  linear  time,  is  on  the 
other  hand  barely  affected  by  the  size  of  rules  it  extracts. 
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Table  1 :  Running  time  in  seconds  of  the  two  algorithms 
on  1000  sentences,  k  represent  the  maximum  size  of  rules 
to  extract. 


In  this  section,  we  present  some  syntactic  transformation 
rules  that  our  system  learns.  Fox  (2002)  identified  three 
major  causes  of  crossings  between  English  and  French: 
the  “ne  ...  pas”  construct,  modals  and  adverbs,  which  a 
child-reordering  model  doesn’t  account  for.  In  section  2, 
we  have  already  explained  how  we  learn  syntactic  rules 
involving  “ne  ...  pas”.  Here  we  describe  the  other  two 
problematic  cases. 

Figure  1 1  presents  a  frequent  cause  of  crossings  be¬ 
tween  English  and  Erench:  adverbs  in  Erench  often  ap¬ 
pear  after  the  verb,  which  is  less  common  in  English. 
Parsers  generally  create  nested  verb  phrases  when  ad¬ 
verbs  are  present,  thus  no  child  reordering  can  allow  a 
verb  and  an  adverb  to  be  permuted.  Multi-level  reodering 
as  the  rule  in  the  figure  can  prevent  crossings.  Pox’s  solu¬ 
tion  to  the  problem  of  crossings  is  to  flatten  verb  phrases. 
This  is  a  solution  for  this  sentence  pair,  since  this  ac¬ 
counts  for  adverb-verb  reorderings,  but  flattening  the  tree 
structure  is  not  a  general  solution.  Indeed,  it  can  only  ap¬ 
ply  to  a  very  limited  number  of  syntactic  categories,  for 
which  the  advantage  of  having  a  deep  syntactic  structure 
is  lost. 

Pigure  12  (dotted  lines  are  P  alignments)  shows  an  in¬ 
teresting  example  where  flattening  the  tree  structure  can¬ 
not  resolve  all  crossings  in  node-reordering  models.  In 
these  models,  a  crossing  remains  between  MD  and  AUX 
no  matter  how  VPs  are  flattened.  Our  transformation  rule 
model  creates  a  lexicalized  rule  as  shown  in  the  figure, 
where  the  transformation  of  “will  be”  into  “sera”  is  the 
only  way  to  resolve  the  crossing. 

In  the  Chinese-English  domain,  the  rules  extracted  by 
our  algorithm  often  have  the  attractive  quality  that  they 
are  the  kind  of  common-sense  constructions  that  are  used 
in  Chinese  language  textbooks  to  teach  students.  Por  in¬ 
stance,  there  are  several  that  illustrate  the  complex  re¬ 
orderings  that  occur  around  the  Chinese  marker  word 
“de.” 
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Figure  12:  Crossing  due  to  a  modal. 

5  Conclusion 

The  fundamental  assumption  underlying  much  recent 
work  in  statistical  machine  translation  (Yamada  and 
Knight,  2001;  Eisner,  2003;  Gildea,  2003)  is  that  lo¬ 
cal  transformations  (primarily  child-node  re-orderings) 
of  one-level  parent-children  substructures  are  an  adequate 
model  for  parallel  corpora.  Our  empirical  results  suggest 
that  this  may  be  too  strong  of  an  assumption.  To  explain 
the  data  in  two  parallel  corpora,  one  English-French,  and 
one  English-Chinese,  we  are  often  forced  to  learn  rules 
involving  much  larger  tree  fragments.  The  theory,  algo¬ 
rithms,  and  transformation  rules  we  learn  automatically 
from  data  have  several  interesting  aspects. 


about  the  transformations  that  one  is  permitted  to 
learn.  If  a  parser,  for  example,  makes  a  systematic 
error,  we  expect  to  learn  a  rule  that  can  neverthe¬ 
less  be  systematically  used  to  produce  correct  trans¬ 
lations. 

In  this  paper,  we  focused  on  providing  a  well-founded 
mathematical  theory  and  efficient,  linear  algorithms 
for  learning  syntactically  motivated  transformation  rules 
from  parallel  corpora.  One  can  easily  imagine  a  range 
of  techniques  for  defining  probability  distributions  over 
the  rules  that  we  learn.  We  suspect  that  such  probabilis¬ 
tic  rules  could  be  also  used  in  conjunction  with  statistical 
decoders,  to  increase  the  accuracy  of  statistical  machine 
translation  systems. 
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